Take Home Assignment - Origin Financial
Overview
Introduction
Engineering Raw Data
Before starting the data analysis, let’s do a quickly review at available data sets and raise needs of pre-engineering.
Customers
## Rows: 803
## Columns: 7
## $ id <chr> "131d8363-e2a6-4c60-acef-da03e6bfbdc4", "831216da-744b-4…
## $ created_at <chr> "2020-03-11 19:38:35", "2020-05-01 13:09:00", "2020-03-1…
## $ date_of_birth <chr> "1990-08-16", "1933-11-22", "1986-02-02", "1990-01-01", …
## $ gender <chr> "male", "", "", "male", "", "", "", "male", "male", "mal…
## $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "U…
## $ state <chr> "CA", "WA", "NY", "CA", "", "CA", "", "CA", "NY", "CA", …
## $ city <chr> "San Francisco", "Tonasket", "New York", "San Francisco"…
In a quick overview, it’s possibly to note that variable created_at and date_of_birth should be modified to timestamp and date. Also, we can note that all missing data are represented by empty space and it will be replaced by NA.
The variables state and city has some strange characters like this: <img src='#' onerror=alert('xss') /> this strange behavior probably came from a css operator and it will be replaced by NA.
Transactions
## Rows: 11,059
## Columns: 17
## $ X_id <chr> "4D2119A1-03D9-48F2-99B7-FEDD…
## $ user_id <chr> "94102846-0B6D-45D8-AFD1-DA80…
## $ account_id <chr> "88F8D694-AFF0-4CFC-BBB6-C89D…
## $ account_name <chr> "Chase - Plaid Money Market (…
## $ description <chr> "ACH Electronic CreditGUSTO P…
## $ type <chr> "expense", "expense", "expens…
## $ amount <dbl> -5850, -5850, -5850, -5850, -…
## $ date <chr> "2021-06-08T00:00:00.000Z", "…
## $ extra_fields.category.0 <chr> "Transfer", "Transfer", "Tran…
## $ extra_fields.category.1 <chr> "Debit", "Debit", "Debit", "D…
## $ extra_fields.category.2 <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.category_id <int> 21006000, 21006000, 21006000,…
## $ extra_fields.merchant_name <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.name <chr> "ACH Electronic CreditGUSTO P…
## $ extra_fields.payment_channel <chr> "other", "other", "other", "o…
## $ extra_fields.payment_meta.payment_method <chr> "ACH", "ACH", "ACH", "ACH", "…
## $ created_at <chr> "2021-06-18T19:14:32.075Z", "…
Exploratory Data Analysis
## Warning: Couldn't find skimmers for class: POSIXlt, POSIXt; No user-defined
## `sfl` provided. Falling back to `character`.
| Name | db_users |
| Number of rows | 803 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 36 | 36 | 0 | 803 | 0 |
| created_at | 0 | 1.00 | 2414 | 5633 | 0 | 803 | 0 |
| gender | 751 | 0.06 | 4 | 10 | 0 | 4 | 0 |
| country | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| state | 179 | 0.78 | 2 | 2 | 0 | 27 | 0 |
| city | 176 | 0.78 | 4 | 21 | 0 | 120 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_of_birth | 173 | 0.78 | 1000-01-01 | 2001-01-01 | 1989-12-12 | 119 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age_at | 173 | 0.78 | 33.56 | 40.21 | 19.41 | 29.63 | 31.08 | 33.99 | 1020.33 | ▇▁▁▁▁ |
- falar sobre o outlier na data
- falar sobre o número de missing no gender
Univariate Analysis
Age
State
## Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
## Need '+proj=longlat +datum=WGS84'
- talk about: more than half of the sample is from CA or has no available state
Users Transactions
- talk that we have just 84 customer with transactions
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
| Number of Accounts | Number of Customers | % of Customers |
|---|---|---|
| 1 | 70 | 83.33 |
| 2 | 7 | 8.33 |
| 3 | 1 | 1.19 |
| 4 | 2 | 2.38 |
| 5 | 2 | 2.38 |
| 6 | 1 | 1.19 |
| 7 | 1 | 1.19 |
- falar que quase 20% dos customers com transacoes tem mais de uma conta, porem para as analises vou assumir apenas o id do customer
Type of transacions
- falar que tem uma mistura de gastos e salários com transferências e que isso deveria ser visto apartado
- começar a falar da ideia do rfm
Transactions Category
| extra_fields_category_0 | extra_fields_category_1 | extra_fields_category_2 | extra_fields_category_id | n | perc |
|---|---|---|---|---|---|
| NA | NA | NA | NA | 2742 | 100 |
| Travel | Taxi | NA | 22016000 | 1230 | 100 |
| Food and Drink | Restaurants | Fast Food | 13005032 | 1057 | 100 |
| Travel | Airlines and Aviation Services | NA | 22001000 | 1055 | 100 |
| Food and Drink | Restaurants | NA | 13005000 | 1053 | 100 |
| Payment | Credit Card | NA | 16001000 | 623 | 100 |
| Food and Drink | Restaurants | Coffee Shop | 13005043 | 614 | 100 |
| Transfer | Credit | NA | 21005000 | 607 | 100 |
| Payment | NA | NA | 16000000 | 443 | 100 |
| Recreation | Gyms and Fitness Centers | NA | 17018000 | 443 | 100 |
| Shops | Sporting Goods | NA | 19046000 | 443 | 100 |
| Transfer | Debit | NA | 21006000 | 425 | 100 |
| Transfer | Deposit | NA | 21007000 | 324 | 100 |
- Falar um pouco sobre as categorias e que sera explorado um pouco mais na descritiva com a quantidade de transacoes e amount
Merchant Name
- Falar que temos os dados apenas para expense
Channel Transactions
- Falar que temos os dados apenas para expense
Payment Methods
- just 4% of transactions with this data, it’s almost useless
RFM Transformation
- Analisando o problema e vendo as possibilidades de cluster, optei pela transdormação dos dados trnsacionais para o modo recency, frequencia e monetary value.
A transformação foi feita pensando em duas abordagens… RFM dos tipos de transacao (expense, income, transfer) e RFM das categorias de expense.
Depois da transformação dos dados transacionais, é hora de juntar com os dados dos usuarios.
Ao final teremos um dataset com 84 customers e 36 variaveis e temos algumas ressalvas.
## Rows: 84
## Columns: 36
## $ id <chr> "AD7226ED-2D26-45FD-AB37-C3823CD5DB0C", …
## $ state <chr> "NY", "CA", "IL", "NY", NA, "VA", "NY", …
## $ age_at <dbl> 34.10458, 32.52759, 33.34352, 29.65223, …
## $ recency_days_expense <dbl> 24, 30, 30, 31, 22, 0, 0, 0, 0, 0, 0, 0,…
## $ transaction_count_expense <dbl> 97, 61, 73, 197, 196, 0, 0, 0, 0, 0, 0, …
## $ amount_expense <dbl> 2825.37, 15654.90, 18785.88, 205553.34, …
## $ amount_mean_expense <dbl> 29.12753, 256.63770, 257.34082, 1043.417…
## $ recency_days_income <dbl> 39, 30, 30, 31, 39, 16, 16, 16, 16, 16, …
## $ transaction_count_income <dbl> 24, 91, 109, 26, 24, 3, 3, 3, 3, 3, 3, 3…
## $ amount_income <dbl> 12000.00, 585798.60, 702958.32, 25000.00…
## $ amount_mean_income <dbl> 500.0000000, 6437.3472527, 6449.1588991,…
## $ recency_days_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ transaction_count_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_mean_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ n_transactions_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ n_transactions_food_and_drink <dbl> 72, 0, 0, 246, 122, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_payment <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_recreation <dbl> 0, 0, 0, 50, 24, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_shops <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_travel <dbl> 73, 0, 0, 196, 98, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_transfercat <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_food_and_drink <dbl> 2537.52, 0.00, 0.00, 53286.50, 27537.52,…
## $ amount_payment <dbl> 0.0, 0.0, 0.0, 99768.0, 51962.5, 0.0, 0.…
## $ amount_recreation <dbl> 0, 0, 0, 3925, 1884, 0, 0, 0, 0, 0, 0, 0…
## $ amount_shops <dbl> 0, 0, 0, 24000, 12500, 0, 0, 0, 0, 0, 0,…
## $ amount_travel <dbl> 12287.85, 0.00, 0.00, 49573.84, 24787.85…
## $ amount_mean_transfercat <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_mean_food_and_drink <dbl> 35.24333, 0.00000, 0.00000, 216.61179, 2…
## $ amount_mean_payment <dbl> 0.0, 0.0, 0.0, 2078.5, 2078.5, 0.0, 0.0,…
## $ amount_mean_recreation <dbl> 0.0, 0.0, 0.0, 78.5, 78.5, 0.0, 0.0, 0.0…
## $ amount_mean_shops <dbl> 0, 0, 0, 500, 500, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_mean_travel <dbl> 168.3267, 0.0000, 0.0000, 252.9278, 252.…
## $ no_channel <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ other <dbl> 0.1655172, 0.0000000, 0.0000000, 0.16666…
## $ in_store <dbl> 0.8344828, 0.0000000, 0.0000000, 0.83333…
Expense vs Income
Aqui podemos notar que tem customer com alta renda e pouco gasto, sem renda e sem gasto, renda media e gasto medio e gasto alto e renda media. isso já é uma pista pro nosso cluster.
também temos um outlier nos gastos
Expense Vs Income Vs Age
## Warning: Removed 6 rows containing missing values (geom_point).
A idade nao parece influenciar nos tipos de transacoes na base, quanto a salario e despesas.
Expense vs Income vs Transfer
Os valores transferidos nao parecem ter relacao com os gastos e com a renda, alem de terem poucas transacoes, nao usaremos no cluster.
Relationship between types of expenses
## `summarise()` has grouped output by 'user_id', 'date'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.